Introduction


Overview

For this project, we would like to simulate that Team Outlier is a team of Data Scientists that Capital Bikeshare company hired to come up with data driven answers to help them with their decisions.

The following business questions will be guided by the two-year historical log corresponding to years 2011 and 2012.

Use Case 1 - Operational Expenses Analysis

The company need to reduce operational expenses by looking at the manpower to cater extra demand. Hence Team Outlier will predict the total ridership count based on different variables. In order to predict exact operational expenses this model will try to answer below questions -

  • Does Season play a significant role in Bike ridership count?
  • Does daily environment factors such as weather, temperature or humidity affect the bike ridership count so that Capital Bike share compnay can adjust his human capital?
  • Does holidays, weekdays, or weekends play an important role in predicting the ridership count?

Use Case 2 - Targeted Marketing Strategy Analysis

Capital Bikeshare would like to have a targeted marketing strategy to help them efficiently spend their budget. The Team Outlier needs to determine the recommended season to have marketing promotional offers to increase the number of customers.

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available. Our source (UCI Machine Learning Repository) aggregated the data on two hourly and daily basis. They extracted and added the corresponding weather and seasonal information. Weather information were extracted from http://www.freemeteo.com.

There are two datasets – hour.csv and day.csv. Both datasets have the same fields except hr which is not available in day.csv. The hour.csv contains bike sharing counts aggregated on hourly basis and it has records of 17379 hours. The day.csv contains bike sharing counts aggregated on daily basis and it has records of 731 days.


Methods


Use Case 1 - Operational Expenses Analysis

Loading the Dataset

library(readr)
day_data = read.csv("dataset/day.csv")
#head(day_data)

hour_data = read.csv("dataset/hour.csv")
#head(hour_data)

Data Cleaning

  • Change Numeric Variables to Factor Variables
# Convert Season to Factor
day_data$season = as.factor(day_data$season)
# Set Seasons
levels(day_data$season) = c("Spring", "Summer", "Fall", "Winter")

# Convert Holiday to Factor
day_data$holiday = as.factor(day_data$holiday)
# Set Holiday
levels(day_data$holiday) = c("No", "Yes")

#Convert WorkingDay to Factor
day_data$workingday = as.factor(day_data$workingday)
levels(day_data$workingday) = c("No", "Yes")

#Convert Weather to Factor
day_data$weathersit = as.factor(day_data$weathersit)
levels(day_data$weathersit) = c("Clear", "Mist", "LightPrecip")

# Convert Week days toFactor
day_data$weekday = as.factor(day_data$weekday)
levels(day_data$weekday) = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")  

# Convert Month toFactor
day_data$mnth = as.factor(day_data$mnth)
levels(day_data$mnth) = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") 

# Check for Missing Field
day_data = na.omit(day_data)
hour_data = na.omit(hour_data)

Data Exploration

Before going in to the modelling we will explore the data set to uncover initial patterns, characteristics, and points of interest. This exploratory analysis will look at the distributions of individual predictors, relationships between predictors and the response, correlation and interaction between predictors as related to response.

This exploratory analysis is included in detailed manner in Appendix section.

Model the Data

We will remove few variables such as Instance and date which are unique to each row, these variables dont contribute much to total ridership.

data = day_data[, c(-1, -2, -14, -15)]  # removing instance, date, causal and registered variable (these are not needed for our model).

Define Functions

Function to calculate LOOCV-RMSE

calc_loocv_rmse = function(model) {
  temp = (resid(model) / (1 - hatvalues(model))) ^ 2
  temp = temp[is.finite(temp)]
  sqrt(mean(temp))
}

Separate Numeric and Categorical Variables.

numerical = unlist(lapply(data, is.numeric)) # contains boolean value whether a varible is having a numeric value or not? 

Modelling with Numeric Predictors Only

Lets begin by creating a model using only numerical predictors

data_numerical = data[, numerical] # get all the numerical columns
bike_mod_num = lm(cnt ~ ., data = data_numerical) # model with all numeric variables

# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_num)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.7261
calc_loocv_rmse(bike_mod_num) # get the loocv rmse
## [1] 1042

This model has a very high cross validated RMSE and low Adjusted \(R^2\). Hence we can conclude that model is not explaining the response very well.

Let’s now examine the p-values for the coefficients of the model:

summary(bike_mod_num)$coefficients[, 'Pr(>|t|)']
## (Intercept)          yr        temp       atemp         hum   windspeed 
##   7.548e-19  6.263e-109   2.938e-01   4.506e-03   1.193e-15   7.786e-15

The above summary results show that the temp is not very significant predictor as it has a high p-value, this may be because of collinearity between temp and atemp varible.

par(mfrow = c(2, 2))
plot(bike_mod_num,col = 'dodgerblue') # create diagnostics model

The Fitted vs Residual plot does not show constant variance hence violating the assumption of linear regression model. The leverage plot also highlights some outliers.

Modelling with Numeric and Categorical Predictors

bike_mod_all = lm(cnt ~ ., data = data) # modelling with all the variables

# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423
calc_loocv_rmse(bike_mod_all) # get the loocv rmse
## [1] 794.8

By including the categorical variables, Adjusted \(R^2\) has substantially improved and it has also lowered the loocv-rmse.

P-values for coefficients for the model:

summary(bike_mod_all)$coefficients[, 'Pr(>|t|)'] # get the cofficeints
##           (Intercept)          seasonSummer            seasonFall 
##             9.775e-10             1.032e-06             1.025e-04 
##          seasonWinter                    yr               mnthFeb 
##             2.280e-17            2.295e-154             3.624e-01 
##               mnthMar               mnthApr               mnthMay 
##             1.085e-03             6.882e-02             6.145e-03 
##               mnthJun               mnthJul               mnthAug 
##             6.842e-02             9.219e-01             1.426e-01 
##               mnthSep               mnthOct               mnthNov 
##             1.652e-04             3.179e-02             6.133e-01 
##               mnthDec            holidayYes            weekdayMon 
##             6.231e-01             1.130e-03             5.319e-02 
##            weekdayTue            weekdayWed            weekdayThu 
##             3.982e-03             4.136e-04             3.498e-04 
##            weekdayFri            weekdaySat        weathersitMist 
##             5.296e-05             4.007e-05             3.156e-09 
## weathersitLightPrecip                  temp                 atemp 
##             5.386e-22             4.153e-02             2.223e-01 
##                   hum             windspeed 
##             2.014e-07             2.092e-11

The p-values of all the parameters of the additive model indicate that not all of the variables are significant.

Diagnostic plots for the model:

par(mfrow = c(2, 2))
plot(bike_mod_all,col = 'dodgerblue') # create diagnostics model

By including categorical variables also in the model, we still see some issues in the diagnostic plots.

  • The Fitted vs Residual plot shows a non linear trend also does not show constant variance, hence violating the assumption of linear regression model. We also see presence of some extreme outlier in the plot.

  • We see some fat tails in the Normal Q-Q plot.

  • The Residuals vs Leverage plot also indicates presence of some outliers which we might have to check on as we go down the analysis.

Based upon the above results, we will examine below the significance of several of the categorical variables in the response variable:

  • Month
  • Week Day
  • Working Day
bike_mod_w_month = lm(cnt ~ . - mnth, data = data) # model without month
bike_mod_w_weekday = lm(cnt ~ . - weekday, data = data) # model without weekday
bike_mod_w_workingday = lm(cnt ~ . - workingday, data = data) # model without workingday

#  anova test to compare above three models
anova(bike_mod_w_month,
      bike_mod_w_weekday,
      bike_mod_w_workingday,
      bike_mod_all) 
## Analysis of Variance Table
## 
## Model 1: cnt ~ (season + yr + mnth + holiday + weekday + workingday + 
##     weathersit + temp + atemp + hum + windspeed) - mnth
## Model 2: cnt ~ (season + yr + mnth + holiday + weekday + workingday + 
##     weathersit + temp + atemp + hum + windspeed) - weekday
## Model 3: cnt ~ (season + yr + mnth + holiday + weekday + workingday + 
##     weathersit + temp + atemp + hum + windspeed) - workingday
## Model 4: cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit + 
##     temp + atemp + hum + windspeed
##   Res.Df       RSS Df Sum of Sq     F  Pr(>F)    
## 1    713 472452877                               
## 2    707 428442679  6  44010198 12.40 2.7e-13 ***
## 3    702 415401688  5  13040991  4.41 0.00059 ***
## 4    702 415401688  0         0                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova test shows that month and weekday are significant predictors, hence we cannot rule them out. Even though the month variable is statistically significant, it might be that just few levels are useful and rest of them does not help. We will use model selection schemes later to investigate that.

According to test results working day variable seems to be non significant and we can rule out this variable.

We will now re-fit the model excluding working day:

data_2 = data[, c(-6)] # remove working day variable
bike_mod_all_2 = lm(cnt ~ ., data = data_2) # model with all remaining variable

# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_2)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423
calc_loocv_rmse(bike_mod_all_2) # get the loocv rmse
## [1] 794.8

Excluding working day had no effect on the \(R^2\) of the model, so we can safely remove this variable.

Identifying Collinearity in Model

Now we will using the Variance Inflation Factor to determine if we have multi-collinearity issues in the model:

# install.packages("car")
library(car)
## Loading required package: carData
car::vif(bike_mod_all_2)
##               GVIF Df GVIF^(1/(2*Df))
## season     169.905  3           2.353
## yr           1.047  1           1.023
## mnth       408.760 11           1.314
## holiday      1.121  1           1.059
## weekday      1.163  6           1.013
## weathersit   1.895  2           1.173
## temp        80.807  1           8.989
## atemp       70.037  1           8.369
## hum          2.140  1           1.463
## windspeed    1.273  1           1.128

By looking the results of VIF function above, (as we had already suspected by looking to p-value of different variables), temp and atemp have a high level of collinearity. We will now look at the partial correlation coefficient between temp and cnt:

temp_model_1 = lm(temp ~ . - cnt, data = data_2)
temp_model_2 = lm(cnt ~ . - temp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.07684

While this is relatively small, as temp and atemp are highly correlated we should check the partial correlation coefficient after removing atemp:

temp_model_1 = lm(temp ~ . - cnt - atemp, data = data_2)
temp_model_2 = lm(cnt ~ . - temp - atemp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3801

Let us observe the partial correlation coefficient between atemp and cnt, removing temp:

temp_model_1 = lm(atemp ~ . - cnt - temp, data = data_2)
temp_model_2 = lm(cnt ~ . - atemp - temp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3758

These results indicate that temp is more correlated with cnt than atemp is, so we will remove atemp and leave temp:

data_3 = data_2[, -8]
bike_mod_all_3 = lm(cnt ~ ., data = data_3)

# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_3)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8422
calc_loocv_rmse(bike_mod_all_3) # get the loocv rmse
## [1] 789.3

This change has slightly lowered the adjusted \(R^2\) as we would expect removing a predictor would, but has improved the LOOCV-RMSE, which indicates that it has improved the model.

As we have concerns about the year predictor, we will also look at the partial correlation coefficient between year and count:

yr_mod_0 = lm(cnt ~ . - yr, data = data_3)
yr_mod_1 = lm(yr ~ . - cnt, data = data_3)
cor(resid(yr_mod_0), resid(yr_mod_1))
## [1] 0.7943

Partial correlation coefficinet is quite high which indicates that the year has a significant relationship with ridership. We have seen that the ridership seems to be increasing from year to year (with some seasonal cycles within that trend.) Since our data only includes two years this may cause problems with using the model to extrapolate to years that are not included in the training set. However, the high partial correlation coefficient indicates that year is an important predictor so we will keep this it in the model.

Outlier Diagnostics in the Model

Next we would like to check for potential outliers, we have 3 different strategy of doing so :

  • Leverage
  • Standard Residual
  • Cooks Distance

We will be using Cooks Distance to identify any such outlier and see the effect of it on the model.

First, we will calculate the number of observations flagged by Cooks Distance:

sum(cooks.distance(bike_mod_all_3) > 4 / length(cooks.distance(bike_mod_all_3)))
## [1] 44

Next step it to refit a model excluding the identified observations:

cokks_distance = cooks.distance(bike_mod_all_3)
bike_mod_all_4 = lm(cnt ~ .,
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))

# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_4)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9086
calc_loocv_rmse(bike_mod_all_4) # get the loocv rmse
## [1] 586.6

Removing these outliers resulted in a substantial increase in \(R^2\) and a substantial decrease in LOOCV-RMSE.

par(mfrow = c(2, 2))
plot(bike_mod_all_4,col = 'dodgerblue') # Create diagnostics

By looking the Residuals vs Fitted plot, we observed that the distribution of variance looks quite constant, although we do still see some non-linear patterns which indicate that we may want to try some higher order terms or transformations.

The Residuals vs Leverage plot also looks much neater and the Normal Q-Q plot has also improved.

Impact of Interactions in the Model

We will now evaluate including interactions:

bike_mod_all_5 = lm(cnt ~ . ^ 2,
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))

# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_5)[["adj.r.squared"]]
## [1] 0.9467

Including all possible interactions resulted in a substantial improvement in adjusted \(R^2\). However, we can not be sure if this improvement is due to the model or merely due to the inclusion of additional predictors.

calc_loocv_rmse(bike_mod_all_5) # get the loocv rmse
## [1] 759.6

The LOOCV-RMSE has increased, which indicates that the additional terms could have resulted in over fitting the data.

par(mfrow = c(2, 2))
plot(bike_mod_all_5,col = 'dodgerblue') # do diagnostics

length(coef(bike_mod_all_5)) # get the number of params
## [1] 305

The residual vs fitted plot looks random with errors randomly distributed around 0 line, there are still some outliers which are getting highlighted on the plot.

The Q-Q plot looks more normal.

The leverage plot shows some indication of potential new outliers.

Model Selection

Since the model including all possible interactions increased the LOOCV-RMSE, indicating that it was over fitting to the training data, we will perform a backwards AIC step search to remove the non-significant terms.

bike_mod_all_6 = step(bike_mod_all_5, trace = 0, direction = "backward")
length(coef(bike_mod_all_6)) # get the no of params
## [1] 117
summary(bike_mod_all_6)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9482

While decreasing the number of predictors from 305 to 117, the backwards step search also resulted in a very small improvement in adjusted \(R^2\).

calc_loocv_rmse(bike_mod_all_6) # get the loocv rmse
## [1] 470.9

The LOOCV-RMSE also significantly improved, which indicates that this smaller model is a much better model for inference.

par(mfrow = c(2, 2))

plot(bike_mod_all_6,col = 'dodgerblue') # do diagnostics

The Residuals vs Fitted plot still looks normal.

Polynomial Fitting

We have previously seen some issues which indicated that polynomial features might improve the model. We will now evaluate including them:

temp_mod = lm(
  cnt ~ . + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^ 2),
  data = data_3,
  subset = cokks_distance <= 4 / length(cokks_distance)
)
bike_mod_all_7 = step(temp_mod, trace = 0, direction = "backward")
summary(bike_mod_all_7)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9203

The adjusted \(R^2\) is lower than it was for our interaction model.

calc_loocv_rmse(bike_mod_all_7) # get the loocv rmse
## [1] 548.5

The LOOCV-RMSE has also increased.

par(mfrow = c(2, 2))
plot(bike_mod_all_7,col = 'dodgerblue') # do diagnostics

The Residual vs Fitted plot does not look random and shows some non-linear pattern, which indicate that the inclusion of these terms has not improved the model. However, \(temp^2\) has a very low p-value which indicates that it may be useful. We will keep this in mind.

Transformations

Let us evaluate taking a log transformation of the response.

temp_m = lm(log(cnt) ~.^2,
                    data = data_3,
                    subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_8=step(temp_m, trace=0, direction="backward")
summary(bike_mod_all_8)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9379

The adjusted \(R^2\) has been lowered by this transformation.

par(mfrow = c(2, 2))
plot(bike_mod_all_8,col = 'dodgerblue') # do diagnostics

In addition, we observed issues in both the Residuals vs Fitted plot and the Normal Q-Q plot. We can conclude that this transformation was not helpful.

Interactions with Polynomial Terms

Finally, since some of the polynomial terms seemed as if they could be significant, we will try a model which includes interactions and polynomial terms:

bike_mod_int_poly = lm(
  cnt ~ (. ^ 2) + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^ 2),
  data = data_3,
  subset = cokks_distance <= 4 / length(cokks_distance)
)
bike_mod_all_10 = step(bike_mod_int_poly, trace = 0, direction = "backward")
summary(bike_mod_all_10)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9498

The adjusted \(R^2\) is the best we have found yet.

calc_loocv_rmse(bike_mod_all_10) # get the loocv rmse
## [1] 461.4

And this model also results in the lowest LOOCV-RMSE.

Finally we will refit the model using the full data set, find the observations with high leverage and refit the model excluding those items:

bike_mod_int_poly_full = lm(cnt ~ (. ^ 2) + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^
                                                                           2),
                            data = data_3)
bike_mod_all_11 = step(bike_mod_int_poly_full,
                       trace = 0,
                       direction = "backward")
cooks_distance = cooks.distance(bike_mod_all_11) # find influential observations and exclude them
filter = cooks_distance <= (4 / length(cooks_distance))
# some points have a cooks distance of NA, we will consider these to be outliers
filter[is.na(filter)] <- FALSE
bike_mod_int_poly_full = lm(cnt ~ (. ^ 2) + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^
                                                                           2),
                            data = data_3,
                            subset = filter)
bike_mod_all_11 = step(bike_mod_int_poly_full,
                       trace = 0,
                       direction = "backward")
summary(bike_mod_all_11)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9582
calc_loocv_rmse(bike_mod_all_11) # get the loocv rmse
## [1] 426.8

Finding and removing the observations with high leverage has improved the LOOCV-RMSE and the \(R^2\).

Use Case 2 - Targeted Marketing Strategy Analysis

library(readr)
day_data = read.csv("dataset/day.csv")
# knitr::kable(head(day_data)[, 1:15])
#head(day_data)
hour_data = read.csv("dataset/hour.csv")
# knitr::kable(head(hour_data)[, 1:15])
#head(hour_data)

Filter the data based on season

# spring filtered data
spring_data = subset(day_data, day_data$season == 1)
# summer filtered data
summer_data = subset(day_data, day_data$season == 2)
# fall filtered data
fall_data   = subset(day_data, day_data$season == 3)
# winter filtered data
winter_data = subset(day_data, day_data$season == 4)

Remove Possible NA

# remove NA
spring_data = na.omit(spring_data)
summer_data = na.omit(summer_data)
fall_data = na.omit(fall_data)
winter_data = na.omit(winter_data)

Quick Comparison

Let us take a quick look with the filtered data.

# plot to see the difference 
par(mfrow=c(2,2))
hist(spring_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Spring Daily Rental Bike Summary")
hist(summer_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Summer Daily Rental Bike Summary")
hist(summer_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Fall Daily Rental Bike Summary")
hist(winter_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Winter Daily Rental Bike Summary")

Collinearity Check

We will remove the instant and dteday variables since these are data reference index and so we can use the cor() and pair() functions to see what are the predictors that are highly correlated.

day_data_converted = day_data[3:16]
# convert all factor variables to numeric in order to call cor()
for(name in colnames(day_data_converted)) {
  if (is.integer(day_data_converted[[name]]))
    day_data_converted[[name]] = as.numeric(day_data_converted[[name]])
}

Let us visually check the correlation between the predictors. We would see immediately that there are 2 interesting sets of variables. 1) temp and atemp 2) registered and cnt.

library(faraway)
pairs(day_data_converted, col = "dodgerblue")

Now, let us check using cor() function. We will use cor values > 0.5 to see the predictors that are correlated

all_cor = round(cor(day_data_converted), 2)
(correlated_predictors = sort(abs(all_cor["cnt", abs(all_cor["cnt", ]) > 0.5]), decreasing = TRUE)[-1])
## registered     casual       temp      atemp         yr 
##       0.95       0.67       0.63       0.63       0.57

Based on the results of cor() computation, the above predictors will be helpful to predict the recommended season to run promotional offers to reach company’s goal of increasing the number of customers.

Predictor Selection

We will explore for the best model we can use. First, we build a model based on the significant predictors that have high correlation and also build a full model for comparison. Comparing the Adjusted \(R^2\), both of them have high equal values which means the correlation are reliable.

library(knitr)
library(kableExtra)
cor_model = lm(cnt ~ registered + casual + temp + atemp + yr, data = day_data_converted)
summary(cor_model)$adj.r.squared
## [1] 1
full_day_model = lm(cnt ~ ., data = day_data_converted)
summary(full_day_model)$adj.r.squared
## [1] 1
comp_results = data.frame(
  "Correlated Model" = c(
    "Adjusted R-squared" = summary(cor_model)$adj.r.squared,
    "R-Squared"          = summary(cor_model)$r.squared
  ),
  "Full Model" = c(
    "Adjusted R-squared" = summary(full_day_model)$adj.r.squared,
    "R-Squared"          = summary(full_day_model)$r.squared
  )
)
kable(comp_results) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Correlated.Model Full.Model
Adjusted R-squared 1 1
R-Squared 1 1

Both models has good Adjusted \(R^2\). However, we want to check if there are predictors that has high variance inflation factor (VIF) to quantify the effect of collinearity on the variance of our regression estimates. It is also interesting to compare the VIF of the 2 models.

sort(vif(cor_model)[vif(cor_model) > 5], decreasing = TRUE)
## atemp  temp 
## 61.54 60.59
sort(vif(full_day_model)[vif(full_day_model) > 5], decreasing = TRUE)
##      atemp       temp registered 
##     64.806     63.650      5.993

Based on the above comparison, the atemp and temp are consistent concern for both models because of huge collinearity issue.

Running an F-test, let see what would be the preferred model. We observed through the F-test that we prefer the smaller model.

anova(cor_model, full_day_model)
## Analysis of Variance Table
## 
## Model 1: cnt ~ registered + casual + temp + atemp + yr
## Model 2: cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit + 
##     temp + atemp + hum + windspeed + casual + registered
##   Res.Df      RSS Df Sum of Sq F Pr(>F)
## 1    725 7.61e-22                      
## 2    717 1.75e-21  8 -9.86e-22

Let’s take the cor_model and remove the 2 highest VIF’s. Then for the remaining variables, instead checking individually, we will do an Exhaustive Search.

cor_model_no_high_vif = lm(cnt ~ registered + casual + yr, data = day_data_converted)
library(leaps)
all_cnt_mod = summary(regsubsets(cnt ~ ., data = day_data_converted, really.big = FALSE))
best_r2_ind = which.max(all_cnt_mod$adjr2)
# extract the predictor of the model
all_cnt_mod$which[best_r2_ind, ]
## (Intercept)      season          yr        mnth     holiday     weekday 
##        TRUE       FALSE       FALSE       FALSE       FALSE       FALSE 
##  workingday  weathersit        temp       atemp         hum   windspeed 
##       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE 
##      casual  registered 
##        TRUE        TRUE

We don’t think we need to optimize it for LOOCV RMSE since casual and registered predictors has the highest Adjusted \(R^2\).

Compute Prediction Each Season

For computing the mean prediction, we will use the predictors that we identified from the previous process.

best_model = lm(cnt ~ registered + casual, data = day_data)
spring_pred = mean(predict(best_model, newdata = spring_data, interval = c("prediction"), level = 0.99))
summer_pred = mean(predict(best_model, newdata = summer_data, interval = c("prediction"), level = 0.99))
fall_pred = mean(predict(best_model, newdata = fall_data, interval = c("prediction"), level = 0.99))
winter_pred = mean(predict(best_model, newdata = winter_data, interval = c("prediction"), level = 0.99))
# create table for prediction results
pred_results = data.frame(
  "prediction" = c(
    "Spring" = spring_pred,
    "Summer" = summer_pred,
    "Fall"   = fall_pred,
    "Winter" = winter_pred
  )
)
kable(pred_results) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
prediction
Spring 2604
Summer 4992
Fall 5644
Winter 4728

Results


Use Case 1 - Operational Expenses Analysis

Our best model included both interactions and polynomial terms.

summary(bike_mod_all_11)$coef
##                                    Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)                       -1683.158     628.17 -2.67947 7.604e-03
## seasonSummer                       1007.283     514.23  1.95882 5.066e-02
## seasonFall                         5346.488    1355.40  3.94459 9.075e-05
## seasonWinter                       1875.157     977.30  1.91871 5.556e-02
## yr                                 1571.808     290.91  5.40307 9.936e-08
## mnthFeb                             -36.596     426.17 -0.08587 9.316e-01
## mnthMar                            -874.914     487.62 -1.79424 7.335e-02
## mnthApr                           -1649.058     850.65 -1.93860 5.308e-02
## mnthMay                            1594.910    1098.36  1.45208 1.471e-01
## mnthJun                            5073.201    1479.97  3.42790 6.558e-04
## mnthJul                            7675.643    2280.35  3.36600 8.183e-04
## mnthAug                             576.367    2076.58  0.27756 7.815e-01
## mnthSep                            -233.984    1569.62 -0.14907 8.816e-01
## mnthOct                            -248.871    1256.55 -0.19806 8.431e-01
## mnthNov                            -406.180    1189.34 -0.34152 7.328e-01
## mnthDec                            -947.636     981.89 -0.96511 3.349e-01
## holidayYes                        -4186.149    1605.72 -2.60702 9.392e-03
## weekdayMon                          205.959     138.54  1.48668 1.377e-01
## weekdayTue                          349.335     136.33  2.56237 1.067e-02
## weekdayWed                          404.042     140.25  2.88086 4.127e-03
## weekdayThu                          236.222     139.95  1.68793 9.202e-02
## weekdayFri                          430.350     141.75  3.03604 2.516e-03
## weekdaySat                          -16.410     130.45 -0.12579 8.999e-01
## weathersitMist                      551.344     350.46  1.57321 1.163e-01
## weathersitLightPrecip            -14720.917  155712.10 -0.09454 9.247e-01
## temp                               7183.428    1596.36  4.49987 8.373e-06
## hum                                3899.354    1665.27  2.34157 1.957e-02
## windspeed                          8690.057    1838.65  4.72632 2.937e-06
## I(temp^2)                         -5972.664    2799.15 -2.13374 3.333e-02
## I(hum^2)                          -2106.785    1306.74 -1.61224 1.075e-01
## I(windspeed^2)                    -9860.844    2280.57 -4.32385 1.834e-05
## seasonSummer:yr                    1285.263     264.18  4.86517 1.513e-06
## seasonFall:yr                       533.058     326.78  1.63125 1.034e-01
## seasonWinter:yr                     659.955     313.90  2.10244 3.599e-02
## seasonSummer:holidayYes            -872.834     771.16 -1.13185 2.582e-01
## seasonFall:holidayYes             -1920.061    1083.22 -1.77255 7.688e-02
## seasonSummer:weekdayMon            -719.253     174.75 -4.11580 4.477e-05
## seasonFall:weekdayMon              -355.732     172.03 -2.06790 3.914e-02
## seasonWinter:weekdayMon            -242.071     174.39 -1.38808 1.657e-01
## seasonSummer:weekdayTue            -647.481     170.67 -3.79378 1.656e-04
## seasonFall:weekdayTue              -160.706     165.72 -0.96974 3.326e-01
## seasonWinter:weekdayTue               9.149     172.05  0.05318 9.576e-01
## seasonSummer:weekdayWed            -710.701     165.60 -4.29155 2.112e-05
## seasonFall:weekdayWed              -302.930     169.48 -1.78740 7.445e-02
## seasonWinter:weekdayWed            -209.656     173.69 -1.20705 2.280e-01
## seasonSummer:weekdayThu            -476.993     167.15 -2.85376 4.490e-03
## seasonFall:weekdayThu              -252.243     166.95 -1.51093 1.314e-01
## seasonWinter:weekdayThu            -107.387     177.17 -0.60613 5.447e-01
## seasonSummer:weekdayFri            -408.567     170.47 -2.39665 1.689e-02
## seasonFall:weekdayFri              -308.933     165.83 -1.86301 6.302e-02
## seasonWinter:weekdayFri            -229.444     170.85 -1.34292 1.799e-01
## seasonSummer:weekdaySat             231.030     165.28  1.39781 1.628e-01
## seasonFall:weekdaySat               263.169     163.60  1.60862 1.083e-01
## seasonWinter:weekdaySat             371.076     164.71  2.25296 2.467e-02
## seasonSummer:weathersitMist        -722.271     329.65 -2.19100 2.889e-02
## seasonFall:weathersitMist          -563.892     386.75 -1.45804 1.454e-01
## seasonWinter:weathersitMist        -627.249     330.50 -1.89787 5.826e-02
## seasonFall:weathersitLightPrecip  -1223.564   28423.50 -0.04305 9.657e-01
## seasonSummer:temp                 -4982.990    1674.57 -2.97568 3.058e-03
## seasonFall:temp                  -10478.551    2520.38 -4.15753 3.755e-05
## seasonWinter:temp                 -4927.243    1542.84 -3.19361 1.489e-03
## seasonSummer:hum                   2510.785    1053.22  2.38392 1.748e-02
## seasonFall:hum                     1984.759    1633.72  1.21487 2.250e-01
## seasonWinter:hum                   2244.421    1554.85  1.44349 1.495e-01
## yr:mnthFeb                          -23.033     176.96 -0.13016 8.965e-01
## yr:mnthMar                          161.434     214.44  0.75280 4.519e-01
## yr:mnthApr                         -933.973     348.40 -2.68078 7.575e-03
## yr:mnthMay                        -1730.267     378.15 -4.57566 5.926e-06
## yr:mnthJun                        -1460.192     392.56 -3.71966 2.209e-04
## yr:mnthJul                         -906.095     447.75 -2.02364 4.351e-02
## yr:mnthAug                         -394.949     430.68 -0.91705 3.595e-01
## yr:mnthSep                         -336.953     403.59 -0.83490 4.042e-01
## yr:mnthOct                          -25.876     382.07 -0.06772 9.460e-01
## yr:mnthNov                         -519.131     374.19 -1.38734 1.659e-01
## yr:mnthDec                         -600.743     317.58 -1.89162 5.909e-02
## yr:weekdayMon                       447.472     119.09  3.75738 1.909e-04
## yr:weekdayTue                       357.617     119.33  2.99684 2.857e-03
## yr:weekdayWed                       427.096     121.77  3.50737 4.912e-04
## yr:weekdayThu                       655.717     121.02  5.41841 9.162e-08
## yr:weekdayFri                       525.160     118.57  4.42901 1.152e-05
## yr:weekdaySat                       583.690     117.20  4.98020 8.622e-07
## yr:weathersitMist                  -289.958      94.12 -3.08079 2.172e-03
## yr:temp                            1712.949     558.46  3.06726 2.271e-03
## yr:hum                             -968.212     371.61 -2.60549 9.434e-03
## yr:windspeed                       -942.668     487.55 -1.93348 5.371e-02
## mnthFeb:weathersitMist             -113.545     210.18 -0.54023 5.893e-01
## mnthMar:weathersitMist             -165.057     263.33 -0.62682 5.311e-01
## mnthApr:weathersitMist             -344.206     419.60 -0.82032 4.124e-01
## mnthMay:weathersitMist              405.749     428.55  0.94679 3.442e-01
## mnthJun:weathersitMist             -286.810     446.78 -0.64196 5.212e-01
## mnthJul:weathersitMist             -330.613     503.78 -0.65626 5.119e-01
## mnthAug:weathersitMist              -75.573     490.25 -0.15415 8.775e-01
## mnthSep:weathersitMist             -262.345     444.44 -0.59028 5.553e-01
## mnthOct:weathersitMist             -121.264     406.19 -0.29854 7.654e-01
## mnthNov:weathersitMist              403.136     387.55  1.04021 2.987e-01
## mnthDec:weathersitMist              544.194     340.08  1.60020 1.102e-01
## mnthOct:weathersitLightPrecip     -2160.039   29281.34 -0.07377 9.412e-01
## mnthFeb:temp                       -537.529    1124.60 -0.47797 6.329e-01
## mnthMar:temp                       3187.552    1393.64  2.28721 2.258e-02
## mnthApr:temp                       8776.006    2373.97  3.69677 2.412e-04
## mnthMay:temp                       7050.073    2851.35  2.47254 1.373e-02
## mnthJun:temp                       1241.808    3251.55  0.38191 7.027e-01
## mnthJul:temp                      -2171.925    4051.26 -0.53611 5.921e-01
## mnthAug:temp                       7227.527    3819.79  1.89213 5.902e-02
## mnthSep:temp                       9624.696    3315.20  2.90320 3.848e-03
## mnthOct:temp                       7343.715    2337.15  3.14217 1.771e-03
## mnthNov:temp                       4487.196    2367.01  1.89573 5.854e-02
## mnthDec:temp                       5835.861    1806.21  3.23100 1.311e-03
## mnthFeb:hum                         646.600     723.38  0.89385 3.718e-01
## mnthMar:hum                         382.661     828.31  0.46198 6.443e-01
## mnthApr:hum                       -1073.359    1289.66 -0.83228 4.056e-01
## mnthMay:hum                       -4150.273    1335.76 -3.10706 1.991e-03
## mnthJun:hum                       -3742.575    1339.23 -2.79457 5.386e-03
## mnthJul:hum                       -3628.858    1832.90 -1.97984 4.824e-02
## mnthAug:hum                       -3939.536    1854.50 -2.12431 3.411e-02
## mnthSep:hum                       -4590.272    1874.44 -2.44887 1.466e-02
## mnthOct:hum                       -2888.401    1841.23 -1.56874 1.173e-01
## mnthNov:hum                       -1784.942    1767.64 -1.00979 3.131e-01
## mnthDec:hum                       -1577.334    1554.63 -1.01460 3.108e-01
## holidayYes:hum                     6725.199    3288.04  2.04535 4.132e-02
## weekdayMon:weathersitMist           200.223     135.30  1.47984 1.395e-01
## weekdayTue:weathersitMist           -69.502     135.05 -0.51464 6.070e-01
## weekdayWed:weathersitMist           -47.747     135.91 -0.35131 7.255e-01
## weekdayThu:weathersitMist            29.179     134.84  0.21640 8.288e-01
## weekdayFri:weathersitMist          -159.464     136.16 -1.17113 2.421e-01
## weekdaySat:weathersitMist          -162.242     137.11 -1.18329 2.372e-01
## weekdayMon:weathersitLightPrecip   1433.004    3176.30  0.45116 6.521e-01
## weathersitMist:temp                1859.348     556.17  3.34314 8.872e-04
## weathersitLightPrecip:temp        27925.347  340979.07  0.08190 9.348e-01
## weathersitMist:hum                -1743.948     543.31 -3.20988 1.409e-03
## hum:windspeed                     -9939.237    1881.87 -5.28158 1.877e-07

It uses 151 predictors, which is quite a lot, and the p-values for many of them indicate that they may not be significant. However, it appears that the non-significant terms are levels for factor variables which may be difficult to remove.

plot(
  bike_mod_all_11$fitted.values,
  data_3$cnt[filter],
  main = "Fitted Values vs Actual",
  xlab = "Fitted Values",
  ylab = "Ground Truth",
  col = "darkblue",
  pch = 20
)
abline(0, 1, col = "firebrick4", lwd = 3)

We also observed that the fitted value for this model are quite close to the actual values.

summary(bike_mod_all_11)$adj.r
## [1] 0.9582

The adjusted \(R^2\) indicates that the model explains 0.9582 of the variance in the data.

The LOOCV-RMSE of the model is 426.8036.

par(mfrow = c(2, 2))
plot(bike_mod_all_11,col = 'dodgerblue') # do diagnostics

The diagnostic plots all appear to be acceptable, although there are still some points with very high leverage.

sqrt(mean(bike_mod_all_11$residuals^2))
## [1] 343.8

The model has an RMSE of 344 which means that on an average the model predictions are off by this much amount.

# Count the number of high VIF's in the model
library(car)
# car::vif(bike_mod_all_11)

#sum(car::vif(bike_mod_all_11) > 10)

This was expected due to high number of categorical variables and their interactions. Since, we are mainly focusing on demand prediction, we tuned our model for better prediction by sacrificing interpretability.

hist(resid(bike_mod_all_11), 
     col = "lightslateblue", 
     main = "Histogram of Residuals", xlab = "Residuals")

The histogram of residuals looks nearly normal which confirms the fact that the model has noise which are normally distributed and centered about 0.

So far, we have tried multiple approaches to evolve our understanding of what should work to have a model with a good performance without sacrificing the high variance nature of the model. At this point, we have reached a decent spot where we have a model with high adjusted \(R^2\) value and low LOOCV RMSE.

Use Case 2 - Targeted Marketing Strategy Analysis

Based on the F-test, we observed that we prefer the smaller model we got from checking the collinearity which has the predictors – registered, casual, temp, atemp and yr . From the preferred model, we decided to find the best model by removing the highest VIF and we ran exhaustive search to check every possible model. We ended up using registered and casual variables to compute the predictions for each season.


Discussion


Use Case 1 - Operational Expenses Analysis

While the final model contained a large number of predictors, making it difficult to interpret. The results indicate that it is very good at explaining the relationship between weather, season and bike sharing rentals. We tried evaluating using BIC to reduce the size of the final model, however, doing so resulted in a lowered LOOCV-RMSE, so we preferred the AIC selected model.

As expected, the weather situation is an especially important predictor. Both by itself and in its interactions with other predictors, indicating that rain has a significant impact on the number of rentals, especially the interaction between light rain and windspeed.

We have some concerns about the inclusion of the year as a predictor as it may cause problems with using the model for inference on unseen data. While initially it was used as a categorical predictor, we changed it to a numeric predictor in the hopes that it would make the model generalize better to data from the future, should that ever be required. We attempted to model the data excluding year, however, the general increasing ridership between the years proved to be an important factor. Future work could incorporate data from additional years to see if this trend continues.

The high adjusted \(R^2\) of the model indicates that a very large portion of the variance in the data can be explained by this model, which would make it very useful for predicting demand for bikes.

Future Improvisations:

  1. Looking at the distribution of data between demand and type of day as well as from the outlier detection through cook’s method it might be suitable to build separate models for Holidays, Workdays and Weekends.

  2. We can extend this model to apply Seasonality analysis to see how season and days of week affect the response as well analyze the overall demand trend year on year after treating seasonal component in the data.

Use Case 2 - Targeted Marketing Strategy Analysis

On our preliminary assumption, Winter will be the season that would need an enhanced marketing strategy but it turned out upon finding the best model, Spring is the season that would require a better marketing strategy to attract additional customers. Interrogating the best model best_model = lm(cnt ~ registered + casual, data = day_data), the value of cnt is the sum of registered and casual bike riders, which make it highly correlated to the response.

If given additional time, we would like to explore why Spring season got the lowest prediction compared to our initial assumption of Winter season. Are there any methods that would lead us to a different best model? Are people biking a lot during Winter because they want the excitement of having snow and cold temperature? It will also be interesting to gather data around how often a customer rents a bike, a comparison of registered versus casual bike riders, and the type of bicycle – mountain bike, fat bike, or road bike.


Appendix

This report was prepared by Team Outlier – Michael Alpas, Mohit Singh and Upendra Yadav.